{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Name: Md Mintu Miah, ID: 1001405116" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# IMDB-sentiment Analysis Using Naive Bayes Classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test classification is done for the purpose of finding tags or catagories of the text according to their contents. In this analysis, the data set is a collection of 50,000 reviews from IMDB. I have taken the process data from https://www.kaggle.com/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews/data and orginal data is available in here http://ai.stanford.edu/~amaas/data/sentiment/. The purpose of this analysis was exploring the naive bayes classification with text data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Import the data and explore the contents" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Read The data\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.naive_bayes import MultinomialNB" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# Import the data and see the data type" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentiment
0One of the other reviewers has mentioned that ...positive
1A wonderful little production. <br /><br />The...positive
2I thought this was a wonderful way to spend ti...positive
3Basically there's a family where a little boy ...negative
4Petter Mattei's \"Love in the Time of Money\" is...positive
\n", "
" ], "text/plain": [ " review sentiment\n", "0 One of the other reviewers has mentioned that ... positive\n", "1 A wonderful little production.

The... positive\n", "2 I thought this was a wonderful way to spend ti... positive\n", "3 Basically there's a family where a little boy ... negative\n", "4 Petter Mattei's \"Love in the Time of Money\" is... positive" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data=pd.read_csv('C:/Users/mxm5116/Desktop/Data Mining/IMDB Dataset.csv')\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(50000, 2)\n" ] } ], "source": [ "# Check the shape of the data\n", "print(data.shape)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentiment
count5000050000
unique495822
topLoved today's show!!! It was a variety and not...positive
freq525000
\n", "
" ], "text/plain": [ " review sentiment\n", "count 50000 50000\n", "unique 49582 2\n", "top Loved today's show!!! It was a variety and not... positive\n", "freq 5 25000" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now lets, see the summary of the data set\n", "data.describe()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "positive 25000\n", "negative 25000\n", "Name: sentiment, dtype: int64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the positive and negative number of sentiment\n", "data['sentiment'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# a. Divide the dataset as train,and test¶ data sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# First clear and normalized the data and divide again as normalized train, and test data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Now clean the text" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Import library\n", "from bs4 import BeautifulSoup\n", "import re,string,unicodedata\n", "# Removing the html strips\n", "def strip_html(text):\n", " soup = BeautifulSoup(text, \"html.parser\")\n", " return soup.get_text()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Remove the square brackets\n", "def remove_between_square_brackets(text):\n", " return re.sub('\\[[^]]*\\]', '', text)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Remoove the noisy text\n", "def denoise_text(text):\n", " text = strip_html(text)\n", " text = remove_between_square_brackets(text)\n", " return text\n", "#Apply function on review column\n", "data['review']=data['review'].apply(denoise_text)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Now remove special character and apply function for the review colums\n", "def remove_special_characters(text, remove_digits=True):\n", " pattern=r'[^a-zA-z0-9\\s]'\n", " text=re.sub(pattern,'',text)\n", " return text\n", "data['review']=data['review'].apply(remove_special_characters)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Streaming the text\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "import nltk\n", "def simple_stemmer(text):\n", " ps=nltk.porter.PorterStemmer()\n", " text= ' '.join([ps.stem(word) for word in text.split()])\n", " return text\n", "#Apply function on review column\n", "data['review']=data['review'].apply(simple_stemmer)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentiment
0one of the other review ha mention that after ...positive
1A wonder littl product the film techniqu is ve...positive
2I thought thi wa a wonder way to spend time on...positive
3basic there a famili where a littl boy jake th...negative
4petter mattei love in the time of money is a v...positive
\n", "
" ], "text/plain": [ " review sentiment\n", "0 one of the other review ha mention that after ... positive\n", "1 A wonder littl product the film techniqu is ve... positive\n", "2 I thought thi wa a wonder way to spend time on... positive\n", "3 basic there a famili where a littl boy jake th... negative\n", "4 petter mattei love in the time of money is a v... positive" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentimentscore
0one of the other review ha mention that after ...positive1
1A wonder littl product the film techniqu is ve...positive1
2I thought thi wa a wonder way to spend time on...positive1
3basic there a famili where a littl boy jake th...negative0
4petter mattei love in the time of money is a v...positive1
\n", "
" ], "text/plain": [ " review sentiment score\n", "0 one of the other review ha mention that after ... positive 1\n", "1 A wonder littl product the film techniqu is ve... positive 1\n", "2 I thought thi wa a wonder way to spend time on... positive 1\n", "3 basic there a famili where a littl boy jake th... negative 0\n", "4 petter mattei love in the time of money is a v... positive 1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert positive=1 and negative=0 as numeric\n", "def posneg(x):\n", " if x==\"negative\":\n", " return 0\n", " elif x==\"positive\":\n", " return 1\n", " return x\n", "\n", "filtered_score = data[\"sentiment\"].map(posneg)\n", "data[\"score\"] = filtered_score\n", "\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(40000,)\n", "(10000,)\n", "(40000,)\n", "(10000,)\n" ] } ], "source": [ "# Data Preparation for the model\n", "from sklearn.model_selection import KFold, cross_val_score, train_test_split\n", "import random\n", "X = data['review'].values\n", "y = data['sentiment'].values\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "print(X_train.shape)\n", "print(X_test.shape)\n", "print(y_train.shape)\n", "print(y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# b.\tBuild a vocabulary as list. \n", "\t [‘the’ ‘I’ ‘happy’ … ] \n", "# You may omit rare words for example if the occurrence is less than five times\n", "# A reverse index as the key value might be handy\n", " {“the”: 0, “I”:1, “happy”:2 , … }\n" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "train_voca='.'.join(X_train)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "test_voca='.'.join(X_test)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package punkt to\n", "[nltk_data] C:\\Users\\mxm5116\\AppData\\Roaming\\nltk_data...\n", "[nltk_data] Package punkt is already up-to-date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import nltk\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "nltk.download('punkt')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " (0, 136293)\t8\n", " (0, 150904)\t3\n", " (0, 67766)\t5\n", " (0, 75135)\t1\n", " (0, 11562)\t1\n", " (0, 92885)\t1\n", " (0, 41874)\t3\n", " (0, 136505)\t23\n", " (0, 84039)\t3\n", " (0, 50335)\t1\n", " (0, 119747)\t1\n", " (0, 85233)\t1\n", " (0, 133713)\t1\n", " (0, 8639)\t11\n", " (0, 56257)\t1\n", " (0, 85626)\t1\n", " (0, 103934)\t1\n", " (0, 3345)\t1\n", " (0, 88547)\t1\n", " (0, 29843)\t1\n", " (0, 7852)\t1\n", " (0, 129385)\t2\n", " (0, 145433)\t1\n", " (0, 151082)\t1\n", " (0, 155151)\t2\n", " :\t:\n", " (39999, 4780)\t3\n", " (39999, 110500)\t1\n", " (39999, 25078)\t1\n", " (39999, 140945)\t1\n", " (39999, 69211)\t1\n", " (39999, 35608)\t1\n", " (39999, 73389)\t1\n", " (39999, 21410)\t1\n", " (39999, 101470)\t1\n", " (39999, 37086)\t1\n", " (39999, 138509)\t1\n", " (39999, 64282)\t1\n", " (39999, 53674)\t1\n", " (39999, 31076)\t1\n", " (39999, 70371)\t1\n", " (39999, 48701)\t1\n", " (39999, 108453)\t1\n", " (39999, 118323)\t1\n", " (39999, 47309)\t1\n", " (39999, 26024)\t1\n", " (39999, 85408)\t1\n", " (39999, 135801)\t1\n", " (39999, 59687)\t1\n", " (39999, 37944)\t1\n", " (39999, 29741)\t1\n" ] }, { "data": { "text/plain": [ "(40000, 156180)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foovec = CountVectorizer(min_df=1, tokenizer=nltk.word_tokenize)\n", "train_counts = foovec.fit_transform(X_train)\n", "print(train_counts)\n", "train_counts.shape" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "ename": "SyntaxError", "evalue": "invalid syntax (, line 1)", "output_type": "error", "traceback": [ "\u001b[1;36m File \u001b[1;32m\"\"\u001b[1;36m, line \u001b[1;32m1\u001b[0m\n\u001b[1;33m foovec.vocabulary_(1:200)\u001b[0m\n\u001b[1;37m ^\u001b[0m\n\u001b[1;31mSyntaxError\u001b[0m\u001b[1;31m:\u001b[0m invalid syntax\n" ] } ], "source": [ "foovec.vocabulary_" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "156180\n" ] } ], "source": [ "from os import listdir\n", "from collections import Counter\n", "# print the size of the vocab\n", "print(len(foovec.vocabulary_))\n" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['what', 'i', 'kept', 'ask', 'myself', 'dure', 'the', 'mani', 'fight', 'scream', 'match', 'swear', 'and', 'gener', 'mayhem', 'permeat', '84', 'minut', 'comparison', 'also', 'stand', 'up', 'when', 'you', 'think', 'of', 'onedimension', 'charact', 'who', 'have', 'so', 'littl', 'depth', 'it', 'is', 'virtual', 'imposs', 'to', 'care', 'happen', 'them', 'they', 'are', 'just', 'badli', 'written', 'cypher', 'for', 'director', 'hang', 'hi', 'multicultur', 'belief', 'on', 'a', 'topic', 'ha', 'been', 'done', 'much', 'better', 'in', 'other', 'drama', 'both', 'tv', 'cinemai', 'must', 'confess', 'im', 'not', 'realli', 'one', 'spot', 'bad', 'perform', 'film', 'but', 'be', 'said', 'nichola', 'burley', 'as', 'heroin', 'slutti', 'best', 'friend', 'wasim', 'zakir', 'nasti', 'bulli', 'brother', 'were', 'absolut', 'terribl', 'dont', 'know', 'act', 'school', 'graduat', 'from', 'if', 'wa', 'id', 'appli', 'full', 'refund', 'post', 'hast', 'onli', 'samina', 'awan', 'lead', 'role', 'manag', 'impress', 'cast', 'socal', 'british', 'talent', 'well', 'probabl', 'never', 'hear', 'again', 'at', 'least', 'hope', 'next', 'time', 'hire', 'differ', 'scoutanoth', 'intrigu', 'thought', 'hideous', 'fashion', 'soundtrack', 'featur', 'like', 'snow', 'patrol', 'ian', 'brown', 'kean', 'now', 'bit', 'music', 'fan', 'familiar', 'with', 'most', 'these', 'artist', 'output', 'didnt', 'recognis', 'ani', 'track', 'thi', 'movi', 'apart', 'omnipres', 'run', 'bside', 'anyon', 'we', 'get', 'montag', 'which', 'telegraph', 'how', 'suppos', 'feel', 'accompani', 'by', 'such', 'startlingli', 'origin', 'imag', 'coupl', 'kiss', 'swollen', 'lake', 'canoodl', 'doorway', 'problem', 'none', 'song', 'convey', 'mood', 'effici', 'realis', 'lack', 'abil', 'carri', 'emot', 'journey', 'audienc', 'through', 'storytel', 'dialogu', 'aloneth', 'end', 'presum', 'meant', 'dessert', 'everybodi', 'their', 'comeupp', 'there', 'big', 'shock', 'store', 'remain', 'resolut', 'unmov', 'becaus', 'script', 'had', 'given', 'me', 'noon', 'root', 'enough', 'tackl', 'hotbutton', 'issu', 'actual', 'give', 'us', 'plot', 'hasnt', 'alreadi', 'death', 'individu', 'more', 'than', 'window', 'dress', 'nobl', 'failur', 'promis', 'actress', 'few', 'mildli', 'divert', 'punchup', 'save', 'bin', '410', 'tri', 'harder', 'did', 'watch', 'entir', 'could', 'stop', 'dvd', 'after', 'half', 'an', 'hour', 'suggest', 'themselv', 'befor', 'take', 'disc', 'out', 'casei', 'mafia', 'tragic', 'comic', 'corki', 'romano', 'can', 'describ', 'attempt', 'comedyth', 'simpli', 'too', 'hard', 'laugh', 'seem', 'excus', 'move', 'chri', 'kattan', 'scene', 'anoth', 'himself', 'complet', 'overplay', 'subtleti', 'or', 'credul', 'all', 'strang', 'manner', 'come', 'across', 'contriv', 'clearli', 'rather', 'bounc', 'right', 'stori', 'each', 'utterli', 'predict', 'comed', 'event', 'will', 'occur', 'set', 'obviou', 'soon', 'introduc', 'comedi', 'mr', 'bean', 'disast', 'caus', 'titl', 'funni', 'empathis', 'motiv', 'initi', 'situat', 'howev', 'he', 'deliber', 'screw', 'desper', 'draw', 'audienceif', 'play', 'alien', 'connect', 'whose', 'behaviour', 'inexplic', 'except', 'werent', 'stereotyp', 'joke', 'far', 'watchabl', 'isnt', 'touch', 'love', 'reminisc', 'heavili', 'chines', 'poetri', 'use', 'eastern', 'peopl', 'commun', 'focus', 'schoolteach', 'want', 'model', 'teacher', 'good', 'husband', 'father', 'senior', 'student', 'veri', 'attract', 'him', 'unfold', 'see', 'below', 'surfac', '20', 'year', 'marriag', 'grappl', 'moral', 'dilemma', 'face', 'beauti', 'latterday', 'fulci', 'schlocker', 'total', 'abysm', 'concoct', 'deal', 'incur', 'gambler', 'brett', 'halsey', 'decid', 'bluebeardstyl', 'pay', 'off', 'everris', 'debt', 'seduc', 'some', 'ugliest', 'bitch', 'ever', 'lay', 'your', 'eye', 'wealthi', 'widow', 'fulcipen', 'incorpor', 'blackli', 'element', 'result', 'unfunni', 'busi', 'involv', 'corps', 'wont', 'stay', 'put', 'opera', 'singer', 'victim', 'sing', 'etc', 'mention', 'doppelgang', 'theme', 'straight', 'pragu', 'although', 'case', 'two', 'persona', 'via', 'prerecord', 'radio', 'messag', 'cant', 'say', 'surpris', 'show', 'no', 'sign', 'sophist', 'mario', 'bava', 'hatchet', 'honeymoon', '1970', 'resembl', 'sever', 'way', 'content', 'mere', 'pile', 'disgustingli', 'gori', 'nonetooconvinc', 'effect', 'dismemb', 'limb', 'squash', 'melt', 'ala', 'then', 'becom', 'associ', 'first', 'firmli', 'believ', 'norwegian', 'continu', 'tediou', '70', '80', 'place', 'start', 'contain', 'humour', 'imagin', 'made', 'entertain', 'oppos', 'long', 'dark', 'depress', 'boringdur', '90', 'great', 'new', 'filmmak', 'prais', 'critic', 'load', 'money', 'becam', 'normthen', 'came', 'unitedminor', 'spoiler', 'onc', 'thing', 'especi', 'comedian', 'neither', 'nor', 'do', 'anyth', 'where', 'humor', 'awkward', 'clerk', 'harald', 'eia', 'overact', 'ridicul', 'unrealist', 'footbal', 'coach', 'commentari', 'arn', 'scheie', 'funnybut', 'my', 'main', 'rant', 'about', 'unit', 'name', 'here', 'fear', 'standstil', 'sinc', 'seen', 'go', 'exactli', 'present', 'deserv', 'room', 'allal', 'sat', 'realiz', 'need', 'blood', 'make', 'againr', '16', 'receiv', 'posit', 'review', 'site', 'vonnegut', 'am', 'showtim', 'bastard', 'beyond', 'even', 'wasnt', 'poor', 'sean', 'astin', 'brilliant', 'athlet', 'around', 'harrison', 'guy', 'substandard', 'write', 'render', 'tripe', 'bare', 'someon', 'point', 'cute', 'maculay', 'culkin', 'line', 'read', 'pure', 'brillianc', 'sadli', 'intent', 'part', 'mayb', 'youll', 'insan', 'pleas', 'nightmar', 'weekend', 'star', 'actor', 'less', 'idea', 'decipher', 'special', 'sound', 'direct', 'henri', 'sala', 'reason', 'alertsoooo', 'arni', 'incid', 'helicopt', 'disobey', 'order', 'sent', 'jail', 'sort', 'work', 'camp', 'escap', 'short', 'while', 'caught', 'freakish', 'realiti', 'bunch', 'tough', 'eventu', 'die', 'tougher', 'toughest', 'guysi', 'arniefan', 'man', 'flaw', 'annoy', 'crap', 'eg', 'reconstruct', 'insid', 'summari', '510', 'camera', 'angl', 'mean', 'militari', 'flew', 'equip', 'almost', '10', 'crew', 'member', '_inside_', 'beatsther', 'theori', 'interest', 'innov', 'creat', 'pool', 'stupid', 'unreal', 'drownsth', 'sub', 'par', 'averag', 'rest', 'without', 'badth', 'ok', 'impressiver', '310', 'badmouth', 'those', 'understand', 'begin', 'blockbust', 'advers', 'doesnt', 'leonardo', 'dicaprio', 'wilder', 'napalm', 'neat', 'may', 'quirki', 'substanceon', 'particular', 'larg', 'notic', 'import', 'vida', 'life', 'background', 'wallac', 'heard', 'open', 'sequenc', 'lyric', 'instanc', 'men', 'duke', 'earl', 'someth', 'she', 'girl', 'goe', 'over', 'cleverli', 'tension', 'between', 'intricaci', 'look', 'flop', 'outsid', 'real', 'usual', 'forward', 'tvfilm', 'favourit', 'subject', 'mine', 'nice', 'chang', 'documentari', 'kursk', 'stalingrad', 'histori', 'chann', 'avidli', 'pearl', 'harbour', 'enemi', 'gate', 'rude', 'brought', 'down', 'earth', 'malevol', 'stupidifi', 'power', 'hollywood', 'spend', 'fortun', 'tripeso', 'yet', 'got', 'excit', 'rise', 'evil', 'kershaw', 'ive', 'enjoy', 'book', 'whi', 'quitto', 'quot', 'respons', 'rubbishth', 'academ', 'piec', 'wasquit', 'dri', 'nut', 'hitler', 'ye', 'volum', 'biographi', 'detail', 'beth', 'thesi', 'behind', 'behitl', 'hate', 'jew', 'miss', 'emphasis', 'fact', 'everi', 'filmther', 'effort', 'whatsoev', 'explain', 'adopt', 'view', 'strategi', 'needless', 'unlik', 'excel', 'nazi', 'warn', 'neglect', 'nearli', 'leader', 'munich', 'communist', 'jewish', 'colour', 'axiomat', 'link', 'bolshev', 'crucial', 'aspect', 'erabut', 'stuff', 'knew', 'anyway', 'certainli', 'fascin', 'allud', 'briefli', 'socialistcommunist', 'immedi', 'ww1', 'would', 'cours', 'complex', 'handl', 'might', 'detract', 'relentless', 'mantra', 'bang', 'away', 'incessantlyw', 'mesmeris', 'figur', 'public', 'speaker', 'privat', 'polit', 'sympathet', 'espous', 'vegetarian', 'antialcohol', 'antismok', 'guardian', 'reader', 'agre', 'famous', 'fond', 'anim', 'henc', 'wholli', 'invent', 'dogflog', 'absurdh', 'account', 'brave', 'soldier', 'whilst', 'saw', 'iron', 'cross', 'won', 'braveri', 'insight', 'into', 'fire', 'war', 'experi', 'sassoon', 'owen', 'brook', 'remarqu', 'found', 'repel', 'abov', 're', 'jewishbolshevik', 'vital', 'alway', 'despit', 'massiv', 'evid', 'contrari', 'colleagu', 'still', 'drew', 'wrong', 'conclusionsthi', 'eithera', 'often', 'day', 'classic', 'exampl', 'relev', 'leav', 'fit', 'cater', 'lowest', 'common', 'denomin', 'trust', 'inch', 'ram', 'throat', 'correctli', 'dumb', 'fool', 'worldhistori', 'past', 'our', 'wors', 'rubbish', 'opportun', 'lost', 'spent', 'million', 'locat', 'told', 'noth', 'promot', 'period', 'human', 'historywt', '20minut', 'liber', 'fastforward', 'button', 'shot', 'stewart', 'michael', 'zelnik', 'walk', 'hallway', 'door', 'street', 'pensiv', 'confus', 'gave', '2030', 'stretch', 'labour', 'griev', 'cowrot', 'screenplayit', 'hadnt', 'disappointingli', 'three', 'atyp', 'independentsmal', 'studio', 'heart', 'standard', 'formula', 'manipul', 'nonsens', 'cheap', 'corni', 'bore', 'slow', 'pace', 'earli', 'horror', 'rent', 'sens', 'famili', 'live', 'wood', 'invit', 'son', 'wife', 'daughter', 'holiday', 'mother', 'law', 'along', 'until', 'till', 'form', 'esp', 'flashback', 'catastroph', 'unfortun', 'clue', 'bright', 'light', 'signal', 'approach', 'interpret', 'darth', 'vadar', 'voic', 'stolen', 'variou', 'final', 'find', 'killer', 'turn', 'kind', 'japanes', 'warrior', 'ww2', 'appar', 'back', 'claim', 'her', 'doe', 'front', 'shake', 'hand', 'convuls', 'pathet']\n" ] } ], "source": [ "# You may omit rare words for example if the occurrence is less than five times\n", "# keep tokens with a min occurrence\n", "min_occurane = 5\n", "tokens = [k for k,c in foovec.vocabulary_.items() if c >= min_occurane]\n", "print(tokens[1:1000])\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "156175\n" ] } ], "source": [ "print(len(tokens))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Before clearing the rare word, total number of word was 156180 and after removing it, now total number of word is 156175, which indicates that we have only 5 rare words or miss spelled word. As the number is very less, so it will not affect our analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# c.\tCalculate the following probability\n", "\tProbability of the occurrence\n", "•\tP[“the”] = num of documents containing ‘the’ / num of all documents\n", "\tConditional probability based on the sentiment\n", "\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "39815\n" ] } ], "source": [ "words=[\"the\"]\n", "sentences = X_train\n", "count=0\n", "for sentence in sentences :\n", " for word in words :\n", " if word in sentence :\n", " count=count+1\n", " #print(count)\n", " #print(count)\n", "num_of_documents_containing_the=count\n", "print(num_of_documents_containing_the)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "40000\n" ] } ], "source": [ "num_of_all_documents=40000\n", "print(num_of_all_documents)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.995375\n" ] } ], "source": [ "Probability_of_the=num_of_documents_containing_the/num_of_all_documents\n", "print(Probability_of_the)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# •\tP[“the” | Positive] = # of positive documents containing “the” / num of all positive review documents" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentimentscore
0one of the other review ha mention that after ...positive1
1A wonder littl product the film techniqu is ve...positive1
2I thought thi wa a wonder way to spend time on...positive1
3basic there a famili where a littl boy jake th...negative0
4petter mattei love in the time of money is a v...positive1
\n", "
" ], "text/plain": [ " review sentiment score\n", "0 one of the other review ha mention that after ... positive 1\n", "1 A wonder littl product the film techniqu is ve... positive 1\n", "2 I thought thi wa a wonder way to spend time on... positive 1\n", "3 basic there a famili where a littl boy jake th... negative 0\n", "4 petter mattei love in the time of money is a v... positive 1" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now take the positive sentiment data from training set\n", "train_data=data[:4000]\n", "positive_docs=train_data.loc[train_data['sentiment']!=0]\n", "positive_docs.head()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['A wonder littl product the film techniqu is veri unassum veri oldtimebbc fashion and give a comfort and sometim discomfort sens of realism to the entir piec the actor are extrem well chosen michael sheen not onli ha got all the polari but he ha all the voic down pat too you can truli see the seamless edit guid by the refer to william diari entri not onli is it well worth the watch but it is a terrificli written and perform piec A master product about one of the great master of comedi and hi life the realism realli come home with the littl thing the fantasi of the guard which rather than use the tradit dream techniqu remain solid then disappear It play on our knowledg and our sens particularli with the scene concern orton and halliwel and the set particularli of their flat with halliwel mural decor everi surfac are terribl well done',\n", " 'I thought thi wa a wonder way to spend time on a too hot summer weekend sit in the air condit theater and watch a lightheart comedi the plot is simplist but the dialogu is witti and the charact are likabl even the well bread suspect serial killer while some may be disappoint when they realiz thi is not match point 2 risk addict I thought it wa proof that woodi allen is still fulli in control of the style mani of us have grown to lovethi wa the most Id laugh at one of woodi comedi in year dare I say a decad while ive never been impress with scarlet johanson in thi she manag to tone down her sexi imag and jump right into a averag but spirit young womanthi may not be the crown jewel of hi career but it wa wittier than devil wear prada and more interest than superman a great comedi to go see with friend',\n", " 'basic there a famili where a littl boy jake think there a zombi in hi closet hi parent are fight all the timethi movi is slower than a soap opera and suddenli jake decid to becom rambo and kill the zombieok first of all when your go to make a film you must decid if it a thriller or a drama As a drama the movi is watchabl parent are divorc argu like in real life and then we have jake with hi closet which total ruin all the film I expect to see a boogeyman similar movi and instead i watch a drama with some meaningless thriller spots3 out of 10 just for the well play parent descent dialog As for the shot with jake just ignor them',\n", " 'petter mattei love in the time of money is a visual stun film to watch Mr mattei offer us a vivid portrait about human relat thi is a movi that seem to be tell us what money power and success do to peopl in the differ situat we encount thi be a variat on the arthur schnitzler play about the same theme the director transfer the action to the present time new york where all these differ charact meet and connect each one is connect in one way or anoth to the next person but no one seem to know the previou point of contact stylishli the film ha a sophist luxuri look We are taken to see how these peopl live and the world they live in their own habitatth onli thing one get out of all these soul in the pictur is the differ stage of loneli each one inhabit A big citi is not exactli the best place in which human relat find sincer fulfil as one discern is the case with most of the peopl we encounterth act is good under Mr mattei direct steve buscemi rosario dawson carol kane michael imperioli adrian grenier and the rest of the talent cast make these charact come alivew wish Mr mattei good luck and await anxious for hi next work']" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# make the list of positive sentiment\n", "train_pos_reviews=positive_docs['review']\n", "train_pos_voca=train_pos_reviews.values.tolist()\n", "train_pos_voca[1:5]" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "# Join the positive sentiment with single dot\n", "train_pos_voca='.'.join(train_pos_voca)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3978\n" ] } ], "source": [ "# Now calculate the number of positive documents having the\n", "words=[\"the\"]\n", "sentences = train_pos_voca\n", "count=0\n", "for sentence in sentences :\n", " for word in words :\n", " if word in sentence :\n", " count=count+1\n", " #print(count)\n", " #print(count)\n", "num_of_pos_documents_containing_the=count\n", "print(num_of_pos_documents_containing_the)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4000\n" ] } ], "source": [ "# Find the totl positive documents in training data set\n", "num_of_all_pos_documents=positive_docs['review'].count()\n", "print(num_of_all_pos_documents)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9945\n" ] } ], "source": [ "# Now calculate P[“the” | Positive] = # of positive documents containing “the” / num of all positive review documents\n", "probability_0f_the_in_positive_docs=num_of_pos_documents_containing_the/num_of_all_pos_documents\n", "print(probability_0f_the_in_positive_docs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# d.\tCalculate accuracy using dev dataset \n", "\t# Conduct five fold cross validation\n" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert the data in vector fpormate\n", "tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))\n", "tf_idf_train = tf_idf_vect.fit_transform(X_train)\n", "tf_idf_test = tf_idf_vect.transform(X_test)\n", "\n", "alpha_range = list(np.arange(0,10,1))\n", "len(alpha_range)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\mxm5116\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10\n", " 'setting alpha = %.1e' % _ALPHA_MIN)\n", "C:\\Users\\mxm5116\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10\n", " 'setting alpha = %.1e' % _ALPHA_MIN)\n", "C:\\Users\\mxm5116\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10\n", " 'setting alpha = %.1e' % _ALPHA_MIN)\n", "C:\\Users\\mxm5116\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10\n", " 'setting alpha = %.1e' % _ALPHA_MIN)\n", "C:\\Users\\mxm5116\\Anaconda3\\lib\\site-packages\\sklearn\\naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10\n", " 'setting alpha = %.1e' % _ALPHA_MIN)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "0 0.8233\n", "1 0.8845749999999999\n", "2 0.879425\n", "3 0.8753749999999998\n", "4 0.8727500000000001\n", "5 0.8703\n", "6 0.8679499999999999\n", "7 0.86595\n", "8 0.8638\n", "9 0.86205\n" ] } ], "source": [ "# take different values of alpha in cross validation and finding the accuracy score\n", "from sklearn.naive_bayes import MultinomialNB\n", "\n", "alpha_scores=[]\n", "\n", "for a in alpha_range:\n", " clf = MultinomialNB(alpha=a)\n", " scores = cross_val_score(clf, tf_idf_train, y_train, cv=5, scoring='accuracy')\n", " alpha_scores.append(scores.mean())\n", " print(a,scores.mean())" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Plot b/w misclassification error and CV mean score.\n", "import matplotlib.pyplot as plt\n", "\n", "MSE = [1 - x for x in alpha_scores]\n", "\n", "\n", "optimal_alpha_bnb = alpha_range[MSE.index(min(MSE))]\n", "\n", "# plot misclassification error vs alpha\n", "plt.plot(alpha_range, MSE)\n", "\n", "plt.xlabel('hyperparameter alpha')\n", "plt.ylabel('Misclassification Error')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "optimal_alpha_bnb\n", "\n", "# For alpha =1, we have got minimum misscalculation error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# e.\tDo following experiments\n", "\tCompare the effect of Smoothing\n", "\tDerive Top 10 words that predicts positive and negative class \n", " •\tP[Positive| word] \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Effects of non-smoothing and smoothing " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# We have already got the effects of smoothing and non-smoothing. When we have considered alpha=0 (non-smoothing), we got the accuracy 82.33% whereas with smoothing our accuacy is always greater than non-smoothing conditions. We have got best smoothing parapmeter alpha=1 with hoighest accuracy 88.46%" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to\n", "[nltk_data] C:\\Users\\mxm5116\\AppData\\Roaming\\nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now lets see the highest positive and negative words that has highest sentiment prediction capacity\n", "import re\n", "import string\n", "import nltk\n", "from nltk.corpus import stopwords\n", "from nltk.stem import PorterStemmer\n", "from nltk.stem.wordnet import WordNetLemmatizer\n", "nltk.download('stopwords')" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "# Now we will remove stop words as it does not carry significant meaning and will store positive and negative word for selections\n", "stop = set(stopwords.words('english')) \n", "sno = nltk.stem.SnowballStemmer('english') \n", "def cleanhtml(sentence): \n", " cleanr = re.compile('<.*?>')\n", " cleantext = re.sub(cleanr, ' ', sentence)\n", " return cleantext\n", "def cleanpunc(sentence): \n", " cleaned = re.sub(r'[?|!|\\'|\"|#]',r'',sentence)\n", " cleaned = re.sub(r'[.|,|)|(|\\|/]',r' ',cleaned)\n", " return cleaned\n", "i=0\n", "str1=' '\n", "final_string=[]\n", "all_positive_words=[] \n", "all_negative_words=[] \n", "s=''\n", "for sent in data['review'].values:\n", " filtered_sentence=[]\n", " sent=cleanhtml(sent) \n", " for w in sent.split():\n", " for cleaned_words in cleanpunc(w).split():\n", " if((cleaned_words.isalpha()) & (len(cleaned_words)>2)): \n", " if(cleaned_words.lower() not in stop):\n", " s=(sno.stem(cleaned_words.lower())).encode('utf8')\n", " filtered_sentence.append(s)\n", " if (data['score'].values)[i] == 1: \n", " all_positive_words.append(s) \n", " if(data['score'].values)[i] == 0:\n", " all_negative_words.append(s) \n", " else:\n", " continue\n", " else:\n", " continue \n", " \n", " str1 = b\" \".join(filtered_sentence) \n", " \n", " final_string.append(str1)\n", " i+=1" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3062885\n", "3002812\n" ] } ], "source": [ "total_positive_words = len(all_positive_words)\n", "total_negative_words = len(all_negative_words)\n", "print(total_positive_words)\n", "print(total_negative_words)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "import random\n", "apw = random.sample(all_positive_words, 10000)\n", "anw = random.sample(all_negative_words, 10000)\n", "freq_negative_words = {x:anw.count(x) for x in anw}\n", "freq_positive_words = {x:apw.count(x) for x in apw}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Lets see positive sentiment first" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
positive_wordsprobability
16b'thi'0.000070
2b'film'0.000049
115b'movi'0.000047
127b'like'0.000027
52b'one'0.000026
341b'stori'0.000017
263b'see'0.000017
49b'time'0.000016
283b'scene'0.000016
201b'make'0.000016
69b'veri'0.000015
93b'watch'0.000015
71b'great'0.000013
27b'love'0.000013
135b'well'0.000013
223b'charact'0.000012
199b'good'0.000012
174b'get'0.000012
169b'also'0.000012
289b'play'0.000011
\n", "
" ], "text/plain": [ " positive_words probability\n", "16 b'thi' 0.000070\n", "2 b'film' 0.000049\n", "115 b'movi' 0.000047\n", "127 b'like' 0.000027\n", "52 b'one' 0.000026\n", "341 b'stori' 0.000017\n", "263 b'see' 0.000017\n", "49 b'time' 0.000016\n", "283 b'scene' 0.000016\n", "201 b'make' 0.000016\n", "69 b'veri' 0.000015\n", "93 b'watch' 0.000015\n", "71 b'great' 0.000013\n", "27 b'love' 0.000013\n", "135 b'well' 0.000013\n", "223 b'charact' 0.000012\n", "199 b'good' 0.000012\n", "174 b'get' 0.000012\n", "169 b'also' 0.000012\n", "289 b'play' 0.000011" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lst=[]\n", "for key in freq_positive_words:\n", " prob = freq_positive_words[key]/total_positive_words\n", " lst.append([key,prob])\n", "table_positive = pd.DataFrame(lst,columns=['positive_words','probability'])\n", "table_positive = table_positive.sort_values('probability', axis=0, ascending=False, inplace=False, kind='quicksort', na_position='last')\n", "table_positive.head(20)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{b'thi': 214,\n", " b'film': 149,\n", " b'movi': 143,\n", " b'like': 83,\n", " b'one': 80,\n", " b'stori': 52,\n", " b'see': 51,\n", " b'time': 50,\n", " b'make': 48,\n", " b'scene': 48,\n", " b'veri': 47}" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from operator import itemgetter\n", "posi={}\n", "i=0\n", "for key, value in sorted(freq_positive_words.items(), key = itemgetter(1), reverse = True):\n", " if(i>10):\n", " break\n", " posi[key]=value\n", " i+=1\n", "posi" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 10 words that predicts positive sentiment\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD4CAYAAAAXUaZHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAWS0lEQVR4nO3de7RkZX3m8e8DErwHCUeHQbAFG6OZiY054gU1GByDkIhOxNhLEZSkdQavMZfWyXghywzRGLOMEWxGAswggiJLokYhRDTGgDTYdoOgArbQ0tMcxQsGRbv5zR/1nlB9qNOnzqmqBrbfz1pnnb3f2rV/777UU7t2XXaqCklSt+xyT3dAkjR+hrskdZDhLkkdZLhLUgcZ7pLUQfe7pzsAsNdee9WyZcvu6W5I0n3KFVdc8Z2qmhp0270i3JctW8batWvv6W5I0n1Kkm/Nd5unZSSpgwx3Seogw12SOshwl6QOMtwlqYMMd0nqIMNdkjrIcJekDjLcJamD7hXfUB3VstWfnMh8N5505ETmK0mT5pG7JHWQ4S5JHWS4S1IHGe6S1EGGuyR1kOEuSR1kuEtSBxnuktRBhrskddCC4Z5k3ySfTXJNkquTvK6175nkoiTfaP8f1tqT5L1JrkuyPskTJ70QkqTtDXPkvhV4Y1U9DngKcEKSxwOrgYurajlwcRsHeC6wvP2tAk4ee68lSTu0YLhX1eaqurIN3wZcA+wDHAWc0SY7A3h+Gz4KOLN6LgX2SLL32HsuSZrXos65J1kGHARcBjyiqjZD7wkAeHibbB/gpr67bWptc+e1KsnaJGtnZmYW33NJ0ryGDvckDwbOA15fVT/c0aQD2upuDVVrqmq6qqanpqaG7YYkaQhDhXuS3egF+1lV9bHWvGX2dEv7f0tr3wTs23f3RwI3j6e7kqRhDPNpmQAfBK6pqr/qu+kC4Ng2fCzw8b72l7VPzTwF+MHs6RtJ0s4xzMU6DgGOATYkWdfa3gycBJyb5HjgRuDodtungCOA64DbgZePtceSpAUtGO5V9QUGn0cHOGzA9AWcMGK/JEkj8BuqktRBhrskdZDhLkkdZLhLUgcZ7pLUQYa7JHWQ4S5JHWS4S1IHGe6S1EGGuyR1kOEuSR1kuEtSBxnuktRBhrskdZDhLkkdNMyVmE5LckuSq/razkmyrv1tnL2IR5JlSX7cd9spk+y8JGmwYa7EdDrwPuDM2Yaq+t3Z4STvBn7QN/31VbViXB2UJC3eMFdi+nySZYNua9dXfRHwG+PtliRpFKOec38GsKWqvtHX9ugkX07yuSTPmO+OSVYlWZtk7czMzIjdkCT1GzXcVwJn941vBvarqoOAPwA+lOShg+5YVWuqarqqpqempkbshiSp35LDPcn9gP8KnDPbVlV3VNV32/AVwPXAgaN2UpK0OKMcuT8buLaqNs02JJlKsmsb3h9YDtwwWhclSYs1zEchzwb+FXhskk1Jjm83vZjtT8kAPBNYn+QrwEeBV1XVrePssCRpYcN8WmblPO3HDWg7Dzhv9G5JkkbhN1QlqYMMd0nqIMNdkjrIcJekDjLcJamDDHdJ6iDDXZI6yHCXpA4y3CWpgwx3Seogw12SOshwl6QOMtwlqYMMd0nqIMNdkjpomIt1nJbkliRX9bW9Lcm3k6xrf0f03famJNcl+VqS35xUxyVJ8xvmyP104PAB7e+pqhXt71MASR5P7wpNv9Lu8/7Zy+5JknaeBcO9qj4PDHupvKOAD7cLZX8TuA44eIT+SZKWYJRz7q9Osr6dtnlYa9sHuKlvmk2tTZK0Ey013E8GDgBWAJuBd7f2DJi2Bs0gyaoka5OsnZmZWWI3JEmDLHiB7EGqasvscJJTgU+00U3Avn2TPhK4eZ55rAHWAExPTw98Ari3Wrb6kxOZ78aTjpzIfCX9/FnSkXuSvftGXwDMfpLmAuDFSXZP8mhgOfCl0booSVqsBY/ck5wNHArslWQT8Fbg0CQr6J1y2Qi8EqCqrk5yLvBVYCtwQlVtm0zXJUnzWTDcq2rlgOYP7mD6dwDvGKVTkqTR+A1VSeqgJb2hqp3LN3AlLZZH7pLUQYa7JHWQ4S5JHWS4S1IHGe6S1EGGuyR1kOEuSR1kuEtSBxnuktRBhrskdZDhLkkdZLhLUgcZ7pLUQYa7JHXQguGe5LQktyS5qq/tXUmuTbI+yflJ9mjty5L8OMm69nfKJDsvSRpsmCP304HD57RdBPynqvpV4OvAm/puu76qVrS/V42nm5KkxVgw3Kvq88Ctc9ourKqtbfRS4JET6JskaYnGcc79FcA/9I0/OsmXk3wuyTPmu1OSVUnWJlk7MzMzhm5IkmaNFO5J/gewFTirNW0G9quqg4A/AD6U5KGD7ltVa6pquqqmp6amRumGJGmOJYd7kmOB3wJeUlUFUFV3VNV32/AVwPXAgePoqCRpeEsK9ySHA38CPK+qbu9rn0qyaxveH1gO3DCOjkqShne/hSZIcjZwKLBXkk3AW+l9OmZ34KIkAJe2T8Y8EzgxyVZgG/Cqqrp14IwlSROzYLhX1coBzR+cZ9rzgPNG7ZQkaTR+Q1WSOshwl6QOMtwlqYMMd0nqIMNdkjrIcJekDjLcJamDDHdJ6iDDXZI6yHCXpA4y3CWpgwx3Seogw12SOshwl6QOMtwlqYOGCvckpyW5JclVfW17JrkoyTfa/4e19iR5b5LrkqxP8sRJdV6SNNiwR+6nA4fPaVsNXFxVy4GL2zjAc+ldXm85sAo4efRuSpIWY6hwr6rPA3Mvl3cUcEYbPgN4fl/7mdVzKbBHkr3H0VlJ0nBGOef+iKraDND+P7y17wPc1Dfdpta2nSSrkqxNsnZmZmaEbkiS5prEG6oZ0FZ3a6haU1XTVTU9NTU1gW5I0s+vUcJ9y+zplvb/lta+Cdi3b7pHAjePUEeStEijhPsFwLFt+Fjg433tL2ufmnkK8IPZ0zeSpJ3jfsNMlORs4FBgrySbgLcCJwHnJjkeuBE4uk3+KeAI4DrgduDlY+6zJGkBQ4V7Va2c56bDBkxbwAmjdEqSNBq/oSpJHWS4S1IHGe6S1EGGuyR1kOEuSR1kuEtSBxnuktRBhrskdZDhLkkdZLhLUgcN9fMD+vmybPUnJzbvjScdObF5S7qLR+6S1EGGuyR1kKdldI/zNJA0fh65S1IHLfnIPcljgXP6mvYH3gLsAfw+MHvV6zdX1aeW3ENJ0qItOdyr6mvACoAkuwLfBs6nd+Wl91TVX46lh5KkRRvXaZnDgOur6ltjmp8kaQTjCvcXA2f3jb86yfokpyV52KA7JFmVZG2StTMzM4MmkSQt0cjhnuQXgOcBH2lNJwMH0Dtlsxl496D7VdWaqpququmpqalRuyFJ6jOOI/fnAldW1RaAqtpSVduq6k7gVODgMdSQJC3COMJ9JX2nZJLs3XfbC4CrxlBDkrQII32JKckDgf8CvLKv+Z1JVgAFbJxzmyRpJxgp3KvqduCX5rQdM1KPJEkj8xuqktRBhrskdZDhLkkdZLhLUgcZ7pLUQYa7JHWQ4S5JHWS4S1IHGe6S1EGGuyR1kOEuSR1kuEtSBxnuktRBhrskdZDhLkkdNNLvuQMk2QjcBmwDtlbVdJI9gXOAZfQu2PGiqvreqLUkScMZOdybZ1XVd/rGVwMXV9VJSVa38T8ZUy1pJMtWf3Ji89540pETm7e0GOMK97mOAg5tw2cAl2C46+fYpJ5QfDLRfMYR7gVcmKSAD1TVGuARVbUZoKo2J3n43DslWQWsAthvv/3G0A1Js3wy0TjC/ZCqurkF+EVJrh3mTu1JYA3A9PR0jaEfkqRm5HCvqpvb/1uSnA8cDGxJsnc7at8buGXUOpLuvXb2+xi+b7KwkcI9yYOAXarqtjb8HOBE4ALgWOCk9v/jo3ZUku4p98Unk1GP3B8BnJ9kdl4fqqpPJ7kcODfJ8cCNwNEj1pEkLcJI4V5VNwBPGND+XeCwUeYtSVo6v6EqSR1kuEtSBxnuktRBhrskdZDhLkkdZLhLUgcZ7pLUQYa7JHWQ4S5JHWS4S1IHGe6S1EGGuyR1kOEuSR1kuEtSBxnuktRBSw73JPsm+WySa5JcneR1rf1tSb6dZF37O2J83ZUkDWOUi3VsBd5YVVcmeQhwRZKL2m3vqaq/HL17kqSlWHK4V9VmYHMbvi3JNcA+4+qYJGnpxnLOPcky4CDgstb06iTrk5yW5GHjqCFJGt7I4Z7kwcB5wOur6ofAycABwAp6R/bvnud+q5KsTbJ2ZmZm1G5IkvqMFO5JdqMX7GdV1ccAqmpLVW2rqjuBU4GDB923qtZU1XRVTU9NTY3SDUnSHKN8WibAB4Frquqv+tr37pvsBcBVS++eJGkpRvm0zCHAMcCGJOta25uBlUlWAAVsBF45Ug8lSYs2yqdlvgBkwE2fWnp3JEnj4DdUJamDDHdJ6iDDXZI6yHCXpA4y3CWpgwx3Seogw12SOshwl6QOMtwlqYMMd0nqIMNdkjrIcJekDjLcJamDDHdJ6iDDXZI6yHCXpA6aWLgnOTzJ15Jcl2T1pOpIku5uIuGeZFfgb4HnAo+nd+m9x0+iliTp7iZ15H4wcF1V3VBVPwU+DBw1oVqSpDlSVeOfafJC4PCq+r02fgzw5Kp6dd80q4BVbfSxwNfG3pHB9gK+s5NqWa8bNa1nvXtrzUdV1dSgG5Z8gewFDLpw9nbPIlW1BlgzofrzSrK2qqatd9+sd0/UtJ717gs155rUaZlNwL59448Ebp5QLUnSHJMK98uB5UkeneQXgBcDF0yoliRpjomclqmqrUleDXwG2BU4raqunkStJdjZp4Ksd9+vaT3r3Rdqbmcib6hKku5ZfkNVkjrIcJekDrrPh3uSZUmuGtB+SZJlbfjNC03fbjsxybMH3H+YGkcnuSbJZ5NMJ3lvaz8uyfsWsyxJNs5XZxKSPC/JX7Tab0ty3KDaSTYm2asNf7H9PzTJJxZR69Akpw+zTsclyR8muWWxtdq2+49LqHdikme3Zfy3uTXGvYzDLF//Y6CNf3GRNebbXtcm+evF9Xjg/LfbjxbaP2b3o1HrLtCnu/WhPT6uHdf2S/K/Z7+9P/dxP6r7fLgP6c0LTwJV9Zaq+scl1jge+O9V9ayqWltVr13ifHa6qroAOHmR93nahLpzb3IcsKhwT7LriPvRpGz3GPg52X73am1f+b2q+uok5t+VcL9fkjOSrE/y0SQPBG4FtiU5CXhAknVJzmrT75rk1CRXJ7kwyQMA2hHlC9s0twLbhqzxFuDpwClJ3jXf0Wyb/8nt6P6GJL+e5LR2xH/6bB3gwYPqtHn8qB1lX5HkH5Mc3I5mbkjyvDbN/ZP8XZINSb6c5Fmt/bIkv9LXn0uS/Fo7Un97q/1bwNsH1Z6zLD8a0PakVm//JA9qy3Z5a5v9+YmfAj9YaJ22+a1sy3BVkr/or53kHUm+kuTSJI9o7VNJzms1L09ySLvLT9rt822/Xdu2uarVe0PbD6aBs9q+84Akh7Vl2dCWbfc2341J3pLkC8DRc/ajO4F3DVu3ze+AJJ9u2/ifk/zyiMt3t8fA7PZr++rnkpyb5OtJTkrykiRfan06oNXYE3hUku8kub3tww9s2/K2Reyby9oyXdn+7vYkk+RJwCdbn89Kcmv7Wwfs3vaPrcDBA9bdY1rtr7T5H9Da/6its/VJ3t7Xl2syOAsOAM4AHpNkS3o/gvhR4I62zNuS/Lck7+zr93FJ/qYNv7Stw3VJPpDe723NrqMTk1wGPLWtn9kvO83MXRcjqar79B+wjN63Xw9p46cBfzhnmh/NmX4rsKKNnwu8tA2fDrxwiTUuAabb8KHAJ9rwccD7+ub/YXrf4D0K+CHwn+k9yV4BHDFEnQKe24bPBy4EdgOeAKxr7W8E/q4N/zJwI3B/4A3A21v73sDX+/p4xhC1NwJ79a/T2WUFntaWYb/W/ud963UP4OvAg4Zdp/SOmG8Epug96fwT8Py+dfDbbfidwJ+24Q8BT2/D+wHXDFnr14CL+sb3GLBN7w/cBBzYxs8EXt+3Xv647/6nAy8coe7FwPI2/GTgn0ZZvrmPgQHb7/v09ofdgW9z1z7yOuCv2/DHZ2u02t8fUGOYffOBwP3b8HJg7Tz70dPa/M4EXtqW6U9p+9EO1t1lwAv6ttkDgefQ+2hi6D3WPgE8kx1nwcWtTwX8Pr39b+72m6L3G1qz4/9A7yDvccDfA7u19vcDL+tbRy8alBvj/uvKkftNVfUvbfj/0lvBO/LNqlrXhq+gt5HHXWM+f1+9rboB2FJVG6rqTuBqet/kXajOT4FPt+ENwOeq6mdteHY5ng78H4Cquhb4FnAgvZ336DbNi4CPjGkZH0fvwfPbVXVja3sOsLodbV1C74G23yLqPQm4pKpmqmorcBa9ByT01sHsK6P+7fds4H2t5gXAQ5M8ZIhaNwD7J/mbJIfTe9Kd67H09puvt/Ez+voDcM6A+yy6bpIH0wu2j7Tl+AC94B1l+RZyeVVtrqo7gOvphTJsv08dAvyM3q+9XtCGD50zn2H2zd2AU5NsoLf/9f9a7L/vR/S+0X5Tu301vXX9R9y1Hw1adw8B9qmq8wGq6idVdTu9ffE5wJeBK+kd8CxvNe+WBX3b4P1tOU+gtw22W69VNQPckOQpSX6J3j7yL8Bh9J58Lm/b6jBg/3a3bcB57AST+m2ZnW3uh/UX+vD+HX3D24AHTKDGQrXvnNOPO+l94WuhOj9rTw7bzaOq7kwyuz0H/bYPVfXtJN9N8qvA7wKvXKDWsMu4md6D7iDu+pmJAL9TVTv6Qbgd1Ru4DE3/OtjGXfvxLsBTq+rH/RO3B968tarqe0meAPwmvQfyi4BXzJl+R/0B+Ld52hdb9/XA96tqxYB5LWn5hjB3P+zfR/vX7f+b7VeS3wBeM2c+w+ybbwC20Dua34V2Sqnp34+ubssQ4HeAfYDXVNULZieeZ90NEuB/VdUHtmvsvSk6KAt2offK5Ah6T1D9yzx3vZ7Tal8LnF9VlSTAGVX1pgF9+UlV3e005yR05ch9vyRPbcMrgS/Muf1nSXabcI1xGUedzwMvAUhyIL0jndmQ/TDwx8AvVtWGMdX+PnAk8OdJDm1tnwFe03Z0khw04H47qncZ8OtJ9mrnK1cCn1ugHxcC/b882h+Q89ZK7xNAu1TVecD/BJ7YbroNmD0yvpbeUd1j2vgxQ/Rn0XWr6ofAN5Mc3aZJC7ElL18z6mPgn4F9+2qcMKDGMH4R2NxerR5D74Bm1r/vR8BT6O23V9F7ElkJfGF2P9rButuU5Pltmt3b+wKfAV7RjshJsk+Sh8/XwdltQC/c90vy1LYNBq3XjwHPb7fNvnq7GHjhbI0keyZ51CLX08i6Eu7XAMcmWU/vjZ+5n/xYA6zPXW+oTqLGuIyjzvvpvWm8gd4Od1x7yQ3wUXq/9XPuOGtX1RZ6L6f/NsmTgT+j9xJ8fXofJ/uzxdSrqs3Am4DPAl8Brqyqjy/QjdcC0+1Ns68Crxpy2fYBLmkvoU9vdWnDp7T2AC+nd7pkA70j01MW6M9S674EOD7JV+gdwc6+Gb3U5YPRHwNvo3e66hNJfkLv6Hqp++axSS6ld6pwu1c8ffvRifTey/gpveCcfTU1ux/Nt+6OAV7b1sMXgf9QVRfSe7/iX9u2+yh3PWnP5yX0Xt3eQe/N3c8wYL1W1feAr9L76d0vtbav0nt/4MLWj4u469TaTuPPD0hSB3XlyF2S1Mdwl6QOMtwlqYMMd0nqIMNdkjrIcJekDjLcJamD/j/+MdpfhmQsZQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.bar(range(len(posi)), list(posi.values()), align='center')\n", "plt.xticks(range(len(posi)), list(posi.keys()))\n", "\n", "print(\"Top 10 words that predicts positive sentiment\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
negative_wordsprobability
30b'thi'0.000084
142b'movi'0.000056
17b'film'0.000041
16b'like'0.000029
250b'one'0.000026
2b'even'0.000021
177b'charact'0.000017
0b'get'0.000016
15b'watch'0.000016
56b'look'0.000014
22b'would'0.000014
163b'good'0.000014
4b'ani'0.000014
282b'make'0.000013
59b'stori'0.000013
391b'becaus'0.000012
68b'scene'0.000012
419b'act'0.000012
1003b'peopl'0.000012
246b'realli'0.000012
\n", "
" ], "text/plain": [ " negative_words probability\n", "30 b'thi' 0.000084\n", "142 b'movi' 0.000056\n", "17 b'film' 0.000041\n", "16 b'like' 0.000029\n", "250 b'one' 0.000026\n", "2 b'even' 0.000021\n", "177 b'charact' 0.000017\n", "0 b'get' 0.000016\n", "15 b'watch' 0.000016\n", "56 b'look' 0.000014\n", "22 b'would' 0.000014\n", "163 b'good' 0.000014\n", "4 b'ani' 0.000014\n", "282 b'make' 0.000013\n", "59 b'stori' 0.000013\n", "391 b'becaus' 0.000012\n", "68 b'scene' 0.000012\n", "419 b'act' 0.000012\n", "1003 b'peopl' 0.000012\n", "246 b'realli' 0.000012" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now lets see top 10 negative sentiment words\n", "lst=[]\n", "for key in freq_negative_words:\n", " prob = freq_negative_words[key]/total_negative_words\n", " lst.append([key,prob])\n", "table_negative = pd.DataFrame(lst,columns=['negative_words','probability'])\n", "table_negative = table_negative.sort_values('probability', axis=0, ascending=False, inplace=False, kind='quicksort', na_position='last')\n", "table_negative.head(20)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{b'thi': 253,\n", " b'movi': 168,\n", " b'film': 124,\n", " b'like': 88,\n", " b'one': 78,\n", " b'even': 63,\n", " b'charact': 51,\n", " b'get': 49,\n", " b'watch': 47,\n", " b'ani': 41,\n", " b'would': 41}" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nega={}\n", "i=0\n", "for key, value in sorted(freq_negative_words.items(), key = itemgetter(1), reverse = True):\n", " if(i>10):\n", " break\n", " nega[key]=value\n", " i+=1\n", "nega" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 10 words that predicts negative sentiment\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD4CAYAAAAXUaZHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAVFUlEQVR4nO3dfbBlVX3m8e8TIaLiC0y3DCKmieIL1iRoLgSCk0BMqWg56EQJJFFwNG0qEDWlNUFrymAsMyQzMTPGSApHBDNGRZGIwjgCkeAbSEOQBvGlB1po6EArijCOKPCbP/a69uH2uX3Ofeu2F99P1a2zzzprr7X2Pvs8Z519Xm6qCklSX35mZw9AkrT8DHdJ6pDhLkkdMtwlqUOGuyR1aLedPQCAVatW1Zo1a3b2MCRpl3LVVVd9u6pWj7vtpyLc16xZw7p163b2MCRpl5LkW/Pd5mkZSeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocmhnuS/ZN8NskNSa5P8vpWfmqSW5Nc0/5eOLLOm5NsSPL1JM9fyQ2QJG1rmm+o3ge8saquTvJo4KokF7Xb/qqq/uto5SQHAccBzwSeAFyc5KlVdf9yDnzWmlMuWIlmAdh42otWrG1JWkkTZ+5Vtbmqrm7LdwM3APttZ5VjgA9X1b1VdROwATh0OQYrSZrOgs65J1kDPAu4ohWdnOTaJGcm2auV7QfcMrLaJsY8GSRZm2RdknVbtmxZ8MAlSfObOtyT7AmcC7yhqr4PnA48GTgY2Az85WzVMatv849aq+qMqpqpqpnVq8f+qJkkaZGmCvckuzME+wer6uMAVXV7Vd1fVQ8A72XrqZdNwP4jqz8RuG35hixJmmSaT8sEeB9wQ1W9c6R835FqLwWua8vnA8cleXiSA4ADgS8v35AlSZNM82mZI4BXAOuTXNPK3gIcn+RghlMuG4HXAlTV9UnOAb7K8Embk1bqkzKSpPEmhntVfZ7x59Ev3M467wDesYRxSZKWwG+oSlKHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjo0MdyT7J/ks0luSHJ9kte38r2TXJTkm+1yr1aeJO9KsiHJtUmevdIbIUl6sGlm7vcBb6yqZwCHASclOQg4Bbikqg4ELmnXAY4GDmx/a4HTl33UkqTtmhjuVbW5qq5uy3cDNwD7AccAZ7dqZwMvacvHAB+oweXA45Lsu+wjlyTNa0Hn3JOsAZ4FXAHsU1WbYXgCAB7fqu0H3DKy2qZWNrettUnWJVm3ZcuWhY9ckjSvqcM9yZ7AucAbqur726s6pqy2Kag6o6pmqmpm9erV0w5DkjSFqcI9ye4Mwf7Bqvp4K7599nRLu7yjlW8C9h9Z/YnAbcszXEnSNKb5tEyA9wE3VNU7R246HzihLZ8AfGKk/JXtUzOHAXfNnr6RJO0Yu01R5wjgFcD6JNe0srcApwHnJHk1cDPw8nbbhcALgQ3AD4BXLeuIJUkTTQz3qvo848+jAzx3TP0CTlriuCRJS+A3VCWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1KGJ/yBb21pzygUr0u7G0160Iu1Keuhx5i5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHVoYrgnOTPJHUmuGyk7NcmtSa5pfy8cue3NSTYk+XqS56/UwCVJ85tm5n4W8IIx5X9VVQe3vwsBkhwEHAc8s63zniQPW67BSpKmMzHcq+oy4M4p2zsG+HBV3VtVNwEbgEOXMD5J0iIs5Zz7yUmubadt9mpl+wG3jNTZ1Mq2kWRtknVJ1m3ZsmUJw5AkzbXYcD8deDJwMLAZ+MtWnjF1a1wDVXVGVc1U1czq1asXOQxJ0jiLCvequr2q7q+qB4D3svXUyyZg/5GqTwRuW9oQJUkLtahwT7LvyNWXArOfpDkfOC7Jw5McABwIfHlpQ5QkLdTEf7OX5EPAkcCqJJuAPwGOTHIwwymXjcBrAarq+iTnAF8F7gNOqqr7V2bokqT5TAz3qjp+TPH7tlP/HcA7ljIoSdLS+A1VSeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOjTx99y186055YIVaXfjaS9akXYl7XzO3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHZoY7knOTHJHkutGyvZOclGSb7bLvVp5krwryYYk1yZ59koOXpI03jQz97OAF8wpOwW4pKoOBC5p1wGOBg5sf2uB05dnmJKkhZgY7lV1GXDnnOJjgLPb8tnAS0bKP1CDy4HHJdl3uQYrSZrOYs+571NVmwHa5eNb+X7ALSP1NrWybSRZm2RdknVbtmxZ5DAkSeMs9xuqGVNW4ypW1RlVNVNVM6tXr17mYUjSQ9tiw/322dMt7fKOVr4J2H+k3hOB2xY/PEnSYiw23M8HTmjLJwCfGCl/ZfvUzGHAXbOnbyRJO85ukyok+RBwJLAqySbgT4DTgHOSvBq4GXh5q34h8EJgA/AD4FUrMGZJ0gQTw72qjp/npueOqVvASUsdlCRpafyGqiR1yHCXpA4Z7pLUIcNdkjo08Q1VPfSsOeWCFWt742kvWrG2JW3lzF2SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA75OXftdH6uXlp+ztwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QO+dsyesjxt2z0UODMXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDi3pc+5JNgJ3A/cD91XVTJK9gY8Aa4CNwLFV9d2lDVOStBDLMXM/qqoOrqqZdv0U4JKqOhC4pF2XJO1AK/EN1WOAI9vy2cClwB+vQD/SLmOlvhXrN2I1n6XO3Av4TJKrkqxtZftU1WaAdvn4cSsmWZtkXZJ1W7ZsWeIwJEmjljpzP6KqbkvyeOCiJF+bdsWqOgM4A2BmZqaWOA5JI3yloCWFe1Xd1i7vSHIecChwe5J9q2pzkn2BO5ZhnJJ+ivlk8tNn0adlkjwqyaNnl4HnAdcB5wMntGonAJ9Y6iAlSQuzlJn7PsB5SWbb+fuq+nSSK4FzkrwauBl4+dKHKUlaiEWHe1XdCPzimPLvAM9dyqAkSUvjN1QlqUP+JyZJu5wd/d+0dsX/3uXMXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ4a7JHVoxcI9yQuSfD3JhiSnrFQ/kqRtrUi4J3kY8DfA0cBBwPFJDlqJviRJ21qpmfuhwIaqurGqfgR8GDhmhfqSJM2Rqlr+RpOXAS+oqte0668AfrmqTh6psxZY264+Dfj6sg9kvFXAt3dQX/a36/e3M/q0v127vx3Z589V1epxN+y2Qh1mTNmDnkWq6gzgjBXqf15J1lXVjP3Z309rn/a3a/e3s/qca6VOy2wC9h+5/kTgthXqS5I0x0qF+5XAgUkOSPKzwHHA+SvUlyRpjhU5LVNV9yU5GfjfwMOAM6vq+pXoaxF29Kkg+9u1+9sZfdrfrt3fzurzQVbkDVVJ0s7lN1QlqUOGuyR1aJcP9yRrklw3pvzSJGva8lsm1W+3/WmS3xhZ/zmT2l4JSV6T5F/a8qlJTpzbb5KXJ7khyWeTzCR5Vys/Mcm7F9jfxin348Ykq9ryF9vlkUk+tYC+jkxy1jT9LdVsH22fnLpcfSxk7KN9t/vyTYvtd6Fja8u/PTK245KctYg2j0vyf0aun9q2a8n3VWvnCVPUeffI9dFt3DhSvujxTPOYW27be6wmuaddrkly6WLa3+XDfUpvmVwFquqtVXXxSg9mChcz+QsQrwb+oKqOqqp1VfW6HTCun6iqX9mR/eknP+uxEGuA316Grg8DHrkM7YxzIrDdcN9BpnnM7VJ6Cffdkpyd5NokH0vySOBO4P4kpwGPSHJNkg+2+g9L8t4k1yf5TJJHALQZ5ctanTuB+7fXdlvnniR/nuSqJBcnObQ929+Y5N+1OnskeX+S9Un+OclRrfyKJM+c3Yi23i8BLwP2S3I2w7d4f3/ONr0VeA7wt0n+y3yz57Y9p7fZ/Y1Jfi3JmW3Gf9ZI1S2T9uOYtu8ZU3ZI276fT/Ko1teVrWz25yd+BNw1TX9Jjm/77Lokfz7ad5J3JPlKksuT7NPKVyc5t/V5JfBLDJ8IexWwdp4+npfkS0muTvLRJHsmOTrJOSP9HZnkk7P1gY8DT8nwSua61u63gH2AT7YxvybJ1cCpwO+M7KaDRo6P14308Q/tGLo+w7e3R7f1T5NcARye5K1t+65LckaStHpPSXIxcGEb27nABcDzk3wF2Bv4IXBXklcn+UYbx3vTZo9z91+SIzLMWn8HWJXkziQbgH/f9t/s8Xhskne2Nl6f5Ma2/OQkn2/L24w7w2NtBvhghsfnI9ox9MV23345yaPbrnhCkk8n+SZwyuyxAzx+nvv19CTr2v5828j+3Jjkbe3+Xp/k6e2mSY+5abbxue1YX9+O/YeP9Dn7incmY2biGT42/qW2j94+ctPsfl64qtql/xhmJwUc0a6fCbxpTp175tS/Dzi4XT8H+N22fBbwsgW2XcDRbfk84DPA7sAvAte08jcC72/LTwduBvYA/gh4WyvfF/hGW37TFP1eCsy05SOBT7XlE4F3j2zPhxm+MXwM8H3g3zA8qV81uw8WsK0bgVWj+3S2b+BXWptPauV/NrJfHwd8A3jUtP0xzOZuBlYzBPQ/Ai8Z2ecvbst/Afyntvz3wHPa8pOADRP6WAVcNjsu4I+Bt7b+bh4pPx343ZH6z2jtvqfVP5PhAfiHrf5/BO4BDmjX926XpwJfBB7e2voOsPucOo8ArgP+1ci2Hjsy5r1Hlv9uZD9cAbx0ZL/+ertvbh6zXzcyhP3uwOfYerzM3X83tOX/NmE//mvgyrb8MYbvuewHnAD85wnjvpStx/HPAjcCh7Trj2n3xYmt/LEMj5tN2xvPnP35sNbHL4wcw7P30x8A/2Oax9ykbWzjugV4aqvzAeANYx43M8ClYx6r5wOvbMsnMZJZi/3rZeZ+S1V9oS3/T4ZZ7fbcVFXXtOWrGB4Qi237R8Cn2/J64J+q6sdtebbd5zAc0FTV14BvAU9leGJ5eatzLPDRkXbvWeA2zeeTNRwx64Hbq2p9VT0AXM+2273Q/TjrGQyf631xVd3cyp4HnJLkGoYH1x4MgTFtf4cwPAi2VNV9wAeBX223/YjhCQUefP/9BvDu1uf5wJ7Apu30cRjDr5Z+oa1zAsNvddzHcJ++OMluwIuAT4zUPxf4MUN4/lxrdw+GGT0Ms627q+omgKoanXldUFX3VtW3gTsYZvsAr2sz7MsZvt194Ehb546sf1SGV3zrGQL8mW12u19VnTeyX/+xLd86Z5sPZThG72zH6egxN3f/PWZk5vy9+fZjVf0LsGeruz/Dk8SvAv+W4clj7LjZ1tOAzVV1ZWv3++2+ALikqu6qqh8yPGnfMeFYPba9cvrn1tfor9LO3k9zH/vzPuam2ManMeTKN9oqZ7P1eJ3GEcCH2vLfLWC9ea3Ub8vsaHM/rD/pw/v3jizfzzBbWmzbP27hCfDAbNtV9UALBhj/WztU1a1JvpPkF4DfAl67gH6nNbutD/Dg7X6Abe//xfa5mSHcnsXWn5kI8JtVtb0fhNtef2P3WTO6z+9n63b8DHB4Vf0/GN6MAv5pQh8XVdXxY/r4CMMM6k6GGdvd7RTIRcCbGQLyoNbPr7d2Z/dvMf8pz7nH3m5JjmQI1sOr6gftZfserc4Pq2r2VMMeDK8WZqrqlgxv1O7Btvtqsfv1QfvvJytk/OE75/qXGE5/fZ0h7P4DcDjwxu2Me66MaXfW3P027zYmOYBhJn5IVX03wynI0f5m2xo9dsZt09TbCBwwz7hhOFMwezyM2+75+luSXmbuT0pyeFs+Hvj8nNt/nGT3FWp7GpfRzrsmeSrDDHY29D7M8DL+sVW1fmSdRy9Dvwu12G39HsPs9s9aUMHw7eQ/HDkn/KwF9ncF8GtJVmV4I/F4tg3quT4DnDxy/aAJfVwOHJHkKW2Mj2z3DwyvNp4N/B5D0P+kPsNs/UkZzsU/tbU7Gj7XAo9tIUOSvSeM+7HAd1uwP53hFcI4s8Hw7SR7Mpwnpqq+D2xK8pJ2+5Pa/XA3w8x0dJu/zLBf92qTj98cue1B+y/JwW3xHuBxE46NyxgC9TKG2fJRwL1Vddd8427uBmZfHXyN4dz6Ia3/R49MkObaZzvjeQzwfxneX9iH4f9KTGPSY2572/g1YM3ssQS8gq3H60aG93/gwft71BcYfqYFHvwezaL1Eu43ACckuZbhXOLpc24/A7g2W99QXc62p/Eehjdx1zMExYlVNRsGH2O4U8+Zs853l6HfhVr0tlbV7cCLgb9J8svA2xnO6V6b4WNrbx+z2rz9VdVmhhnyZ4GvAFdX1ScmDON1wEyGN2i/yvAg2V4fWxjOe36o3X45w3sitNnypxiC4VNz6r+LIczPZQjEvRlCatb3GN5j+Hg71fIRtu/TDDP4axn20+XjKlXV94D3Mpxi+weG876zXtG2/38xvHH6e8D7Gc45n5Dkj1obtzK8H3IFwydEvsrWN7jn7r/fb+WXtO39ZIY3VMcdG59jOF1xWdt3t9DCccK4z2L4YMA1bay/Bfx1228XMf9M91vMf79+hSF8r2c4d/6FsS1sa9Jjbnvb+EOGWf1H2+P8AeBv23pvA/57ks8x5sMJzeuBkzJ8EOCxU453u/z5AekhJsmeVXVPmxWfx/DbT+dNWk+7ll5m7pKmd2qbKV8H3MQwm1ZnnLlLUoecuUtShwx3SeqQ4S5JHTLcJalDhrskdej/A+4KDQVzm7B6AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.bar(range(len(nega)), list(nega.values()), align='center')\n", "plt.xticks(range(len(nega)), list(nega.keys()))\n", "\n", "print(\"Top 10 words that predicts negative sentiment\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# f.\tUsing the test dataset\n", "\tUse the optimal hyperparameters you found in the step e, and use it to calculate the final accuracy. \n" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "optimal_alpha_bnb\n", "\n", "# For alpha =1, we have got minimum misscalculation error" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultinomialNB(alpha=1, class_prior=None, fit_prior=True)" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now lets see Naive bayes model\n", "clf = MultinomialNB(alpha=1)\n", "clf.fit(tf_idf_train,y_train)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "y_pred_test = clf.predict(tf_idf_test)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "****Test accuracy is 88.98\n" ] } ], "source": [ "from sklearn.metrics import accuracy_score\n", "from collections import Counter\n", "from sklearn.metrics import accuracy_score\n", "acc = accuracy_score(y_test, y_pred_test, normalize=True) * float(100)\n", "print('\\n****Test accuracy is',(acc))" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Now lets see the confusion matrix to see the performance in visualization of classification algorithm\n", "import seaborn as sns\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn import metrics\n", "cm_test = confusion_matrix(y_test,y_pred_test)\n", "cm_test\n", "sns.heatmap(cm_test,annot=True,fmt='d')" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "****Train accuracy is 96%\n" ] } ], "source": [ "# Now lets see the train acuracy\n", "y_pred_train = clf.predict(tf_idf_train)\n", "acc = accuracy_score(y_train, y_pred_train, normalize=True) * float(100)\n", "print('\\n****Train accuracy is %d%%' % (acc))" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "cm_train = confusion_matrix(y_train,y_pred_train)\n", "cm_train\n", "sns.heatmap(cm_train,annot=True,fmt='d')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " # With best hyperparameter=1, wh have got test accuracy =88.98% and train accuracy=96% which is good. If we see the confusion matrix, then we can see clear visualization of correct predictions and some wrong predictions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# References\n", "01. https://www.kaggle.com/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews\n", "02. https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184\n", "03. https://www.dataquest.io/blog/naive-bayes-tutorial/\n", "04. https://levelup.gitconnected.com/movie-review-sentiment-analysis-with-naive-bayes-machine-learning-from-scratch-part-v-7bb869391bab\n", "05. https://medium.com/@krsatyam1996/imdb-movie-review-polarity-using-naive-bayes-classifier-9f92c13efa2d\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }